Ryan St.Pierre (ras70) Dakota Brinkman (dlb46) Matt Olson (meo8) Jeremy Schreck (jes85)
The biological question we are trying to answer is: “How do robust metabolic cycles within yeast cells respond under continuous, glucose limited conditions?”
Particularly, we aim to determine which clusters of genes control the metabolic process, which TF/chromatin modifiers control their state, and how the interaction between clusters, through the transcription regulatory network, contribute to this robust metabolic process. Our primary paper attempts to correlate yeast chromatin state and gene expression during the metabolic process. Our biological question has not changed significantly, and we attempt to understand the metabolic cycle within yeast cells with this current homework. In the first homework we investigated what gene expression after periods of starvation reveals about the function of the cell, particularly the metabolic process. In this homework we explore the TFs that control the genes and try to determine how the TFs interact with genes to create complex functionality. In addition, we aim to better tie the effect of the chromatin state, which is obviously tied to the chromatin modifiers, into our analysis and conclusions. We feel that our understanding of how the chromatin modifiers contribute to the process was lacking in the prior homework, which was a consequence of us conducting a gene clustering analysis and not a TF analysis. Thus, our biological question has been extended slightly, to further analyze the TF network (particularly the chromatin modifiers) and determine how the network contributes to the metabolic process and whether or not TFs regulate genes within similar clusters.
Additionally, we aim to derive more meaningful conclusions about which timepoints in the data contribute to the phases found in the paper (RC, OX, RB, RC). We wish to provide more specificity on the delay between these phases and potentially the mechanism driving them. We believe a transcription focused analysis will be instrumental in getting close in answering some of these questions
From our clustering analysis we were able to determine that there are ~10 clusters in the given expression data, a couple of which were extremely well correlated (temporally). This led us to believe there are 3 or 4 superclusters of significant size (~100-150 genes each) that are instrumental in the metabolic process, a conclusion well supported by the literature provided to us. Additionally, we were able to determine that several clusters display similar oscillatory patterns, with a slight phase (delay) difference. This suggests a “just-in-time supply chain” paradigm, in which components from specific processes “become available in a highly correlated manner” Abstract. This is well supported by the findings in the paper, which outlines several phases of the metabolic process: RC, OX, RB, RC. In this homework we seek to further investigate how the TF network contributes to this highly coordinated process.
The TFs that we have been given, in the primary dataset, for this analysis are Gcn5p, Esa1p and Set1p. These are three critical chromatin modifiers, which means they alter the state of the chromatin, allowing other TFs to bind to the DNA, leading to gene expression. The format of this dataset is explained at depth in Question 2. Additionally, we were given a broader TF dataset, which is investigated at the end of this analysis and whose investigation will eventually be overlaid with our work with the chromatin modifiers.
We were given TF binding data for 3 TFs (Gcn, Esa, and Set) in the “kuang-2014-TFbinding.csv”" file. In order to analyze the format of the data we first imported it.
kuangtf <- read.csv("./datasets/kuang-2014-TFbinding.csv")
time <- read.csv("./datasets/kuang-2014-TFbinding-timepts.csv")
kuangrna <- read.csv("./datasets/kuang-2014-rnaseq-expression.csv") ##used later - included in the import section of the code for clarity
head(kuangtf[,1:10])
## Gene.ID Gcn5_1 Gcn5_2 Gcn5_3 Gcn5_4 Gcn5_5
## 1 YAL012W -0.4480247 -0.1089212 -0.20125827 0.2555246 3.147586
## 2 YBL033C 0.3881028 0.3413688 0.17312982 1.7363590 -1.626094
## 3 YBL039C 0.3723152 -0.5988378 -0.08978121 0.7222359 1.648957
## 4 YBL039C-A 0.3723152 -0.5988378 -0.08978121 0.7222359 1.648957
## 5 YBL053W -0.7128402 -0.6348585 -0.78821392 0.9645165 2.102227
## 6 YBL054W -0.7128402 -0.6348585 -0.78821392 0.9645165 2.102227
## Gcn5_6 Gcn5_7 Gcn5_8 Gcn5_9
## 1 0.2977793 0.65587384 -0.002432699 -0.3711936
## 2 0.6026766 0.08901389 -0.332481559 -0.6363833
## 3 1.7260275 -0.52176450 -1.010713667 -1.0700133
## 4 1.7260275 -0.52176450 -1.010713667 -1.0700133
## 5 -1.4582317 0.14640975 -0.851153951 0.7996889
## 6 -1.4582317 0.14640975 -0.851153951 0.7996889
head(time)
## hour
## 1 94.95000
## 2 95.31667
## 3 95.41667
## 4 95.47222
## 5 95.59722
## 6 95.74167
dim(kuangtf)
## [1] 1035 43
dim(time)
## [1] 14 1
From the output above it is clear that our data is time series without knockout. For each TF there are 14 columns, each of which correspond to a time in the kuang-2014-TFbinding-timepts comma-separated data file. The rows of the dataset correspond to specific genes. Thus, each cell of the data gives the binding of the given TF and its time-period (according to the column) for a given gene (given by the row). For simplicity, this binding measure can be thought of how much TF there is for a given gene, or how much TF is bound to the given gene, at that timepoint. This can be thought of as a measure of enrichment over background, with the numbers representing how much TF is bound to a given gene relative to some control.
The rows of the dataset contain 1035 genes. We were curious why 1035 genes were given. After further investigation in the paper we found the following statement:
“Among these [genes in the entire genome], ~1,750 genes were bound by at least two modifiers, and 1,035 genes were bound by all three modifiers, thus suggesting that these enzymes were not necessary for expression of every cycling gene.” Landscape of occupancy of chromatin-modifying enzymes in the YMC
The statement above clarifies which TFs we were given: those in the yeast genome for which all 3 TFs (Gcn, Esa, and Set) did bind.
Since we were given time series data, as opposed to p-value data, answering the question “What threshold did the paper use?” was difficult. The three TFs given are already “known” to bind to the 1035 genes given (discussed above). This was determined through a spatiotemporal analysis. However, the paper is more concerned with trying to gain insight from analyzing the temporal attributes of the binding data. In more plain terms, the authors attempt to associate peak TF binding with a given gene’s expression. This type of analysis was done by directly comparing temporal data, and associating peaks, rather than choosing a threshold value. For this reason we cannot supply a concrete threshold value that was chosen by the paper. In our analysis we use correlation values quite extensively. Thus, it is required for us to have a threshold/cut-off correlation value. This is described in great depth in Question 3 below, where our analysis pipeline is described.
At this point we split the data (associating one data frame per TF) and reformatted the data. We took the transpose of the original data (such that genes become the columns and Tfs became the rows, with time incrementing down in a row-wise fashion). This allowed us to append the time series data to the data frame created. Reformatting of the data in this way was necessary for plotting and further analysis below.
Gcn <- kuangtf[,c(1:15)]
Esa <- kuangtf[,c(1,16:29)]
Set <- kuangtf[,c(1,30:43)]
Gcn.data <- as.data.frame(t(Gcn[,-1]))
colnames(Gcn.data) <- Gcn$Gene.ID
Gcn.time <- cbind(Gcn.data, time = time$hour)
Esa.data <- as.data.frame(t(Esa[,-1]))
colnames(Esa.data) <- Esa$Gene.ID
Esa.time <- cbind(Esa.data, time = time$hour)
Set.data <- as.data.frame(t(Set[,-1]))
colnames(Set.data) <- Set$Gene.ID
Set.time <- cbind(Set.data, time = time$hour)
head(Set.time[,1030:1036])
## YPR030W YPR036W-A YPR065W YPR149W YPR184W
## Set1_1 -0.26440620 -0.33674229 0.5824306 0.5318525 -0.8051170
## Set1_2 -0.65807795 -0.20414164 0.8714891 0.2790511 0.4435870
## Set1_3 -0.07715693 -0.07814229 1.2403187 -0.5818363 0.5868122
## Set1_4 -0.45375589 -0.23528077 2.3568545 -1.3657076 0.7774554
## Set1_5 -1.16253443 -1.58610722 -1.0290589 -1.5386282 1.3501832
## Set1_6 -0.12295147 -0.50561580 -0.3577519 -1.3848435 0.5683965
## YPR193C time
## Set1_1 0.41569527 94.95000
## Set1_2 0.06475232 95.31667
## Set1_3 0.16686641 95.41667
## Set1_4 -0.33227175 95.47222
## Set1_5 -1.38128892 95.59722
## Set1_6 -1.21659685 95.74167
At this point we designed our analysis pipeline. However, the data given to us was not in a familiar format (a format seen in class). This forced us to do some original exploration of the data provided before coming up with a comprehensive approach. This exploration is shown below, after which our pipeline is described.
Before we procceeded we plotted binding data, over time, for different genes. We overlayed the data from the 3 TFs on the same plot, shown below. We created these plots to get a feeling for the binding data, as an original exploration exercise. We found that the binding data followed oscillatory patterns, all slightly different for each gene. This reminded us of the gene expression data, which also oscillated in a similar way.
Additionally, these plots gave us some insight on how the TFs for a given gene behave with respect to each other. The plots for the 3 TFs for a given gene are somewhat overlayed on top of each other, peaking at similar times. Of course, the plots for the TFs are not perfectly correlated, especially for some of the genes. However, the findings do highlight the importance of the peaking behavior.
ggplot() + geom_line(data=Gcn.time, aes(x = time, y = YAL012W, color="Gcn")) + geom_line(data=Set.time, aes(x = time, y = YAL012W, color="Set")) + geom_line(data=Esa.time, aes(x = time, y = YAL012W, color="Esa")) + labs(x = "Time (hours)", y = "Binding of Gene", title = "YAL001C")
ggplot() + geom_line(data=Gcn.time, aes(x = time, y = YBL033C, color="Gcn")) + geom_line(data=Set.time, aes(x = time, y = YBL033C, color="Set")) + geom_line(data=Esa.time, aes(x = time, y = YBL033C, color="Esa")) + labs(x = "Time (hours)", y = "Binding of Gene", title = "YBL033C")
ggplot() + geom_line(data=Gcn.time, aes(x = time, y = YBL039C, color="Gcn")) + geom_line(data=Set.time, aes(x = time, y = YBL039C, color="Set")) + geom_line(data=Esa.time, aes(x = time, y = YBL039C, color="Esa")) + labs(x = "Time (hours)", y = "Binding of Gene", title = "YBL039C")
ggplot() + geom_line(data=Gcn.time, aes(x = time, y = YBL053W, color="Gcn")) + geom_line(data=Set.time, aes(x = time, y = YBL053W, color="Set")) + geom_line(data=Esa.time, aes(x = time, y = YBL053W, color="Esa")) + labs(x = "Time (hours)", y = "Binding of Gene", title = "YBL053W")
ggplot() + geom_line(data=Gcn.time, aes(x = time, y = YBL087C, color="Gcn")) + geom_line(data=Set.time, aes(x = time, y = YBL087C, color="Set")) + geom_line(data=Esa.time, aes(x = time, y = YBL087C, color="Esa")) + labs(x = "Time (hours)", y = "Binding of Gene", title = "YBL087C")
ggplot() + geom_line(data=Gcn.time, aes(x = time, y = YER086W, color="Gcn")) + geom_line(data=Set.time, aes(x = time, y = YER086W, color="Set")) + geom_line(data=Esa.time, aes(x = time, y = YER086W, color="Esa")) + labs(x = "Time (hours)", y = "Binding of Gene", title = "YER086W")
ggplot() + geom_line(data=Gcn.time, aes(x = time, y = YJL045W, color="Gcn")) + geom_line(data=Set.time, aes(x = time, y = YJL045W, color="Set")) + geom_line(data=Esa.time, aes(x = time, y = YJL045W, color="Esa")) + labs(x = "Time (hours)", y = "Binding of Gene", title = "YJL045W")
ggplot() + geom_line(data=Gcn.time, aes(x = time, y = YGL236C, color="Gcn")) + geom_line(data=Set.time, aes(x = time, y = YGL236C, color="Set")) + geom_line(data=Esa.time, aes(x = time, y = YGL236C, color="Esa")) + labs(x = "Time (hours)", y = "Binding of Gene", title = "YGL236C")
These plots gave us insight that the peaking of the TFs might be driving gene expression, or at least they are related. This drove us to a “similarity” or correlation based approach described below.
Intro
From the time plots above and the paper we thought we could find which genes are regulated by which TF(s) by doing a temporal analysis. We reasoned that a TF must regulate a gene if that gene has a peak in expression and binding at a similar time interval. Our reasoning for such is that for a given gene to be expressed the chromatin must be accessable. The given TFs regulate the “opening” and “closing” of the chromatin (they are chromatin modifiers). Thus, it is likely to hold that a given TF will bind heavily at a similar time interval of high gene expression, such that it opens the chromatin state and allows other TFs to bind to lead to high gene expression.
At this point we manually plotted binding data and gene expression over time for several genes. We repeated this process until we found some genes for which the peaks corresponded. This validated the legitimacy of our correlation approach. Note: These manually created plots were not included here to avoid redundancy. They are given in Question 9, where we believe the graphs are more fitting. Please reference this question.
Design
However, we found it would be cumbersome to attempt to match up peaks between expression and binding for all the given genes manually (by looking at plots). Thus, we aimed to come up with an analysis pipeline that intuitively found the relationship in a more automated manner. We believed a correlation based analysis would be the best way to do this. Our goal was to mimick the p-value approach. However, instead of filtering on p-value we filted on the correlation between expression and binding.
The first step in our design was choosing what exactly we wanted to correlate. As stated above we reasoned that for a given gene it is regulated by a TF if the binding data for the TF peaks at the same time as the gene expression. Thus, we wanted to correlate the TF binding data and gene expression for a given gene. This result, we reasoned, would give us a value for how likely the TF regulates the given gene.
Next in the pipeline we had to chose a filtering threshold. We chose a correlation value of 0.75 to distinguish which TFs regulated which genes. At first the chosing of this correlation value was quite arbitrary. In a sense we were “inventing the wheel” for this type of analysis and thus had no point of reference for a good starting point. We tried different correlation thresholds values throughout the writeup and the changes impact are noted in the location of occurence. Overall, as expected increasing the correlation threshold (closer to one) lowers the number of genes that each TF regulates. Additionally, increasing the threshold increases the likelyhood a gene is bound (here we use the term loosely to mean regulated) by two or all TFs. Lowering the threshold has the converse effect. Eventually the value of 0.75 was chosen because we felt it preserved the integrity of the data (eliminated false positives) and get the TF gene groupings relativelty small (~20-50 genes). We thought it was important to keep the TF binding groups small for the sake of meaningful analysis. Much of our network analysis, done below, is visual. Thus, with groups too large the visuals become crowded and difficult to read. We also tried increasing the correlation value. However, this created groups too small. For example, a correlation value of 0.8 created groupings in which one TF was disjoint from the other (did not regulate any of the genes from the other TFs). We reasoned that given the complexity of the network this was not a practical result.
The final step in the pipeline analysis was a basic network analysis. This entailed creating multiple directional graphs that describe the TF network. This is described in much detail below in the remaining questions.
Note: We identified two limitations of our approach: 1. Our pipeline assumes that if a gene’s TF binding data and expression data are strongly correlated they have matching peaks. These peaks is what were really interested. The strong correlation actually implies the two functions follow the same pattern, but the peaks can still be slightly off. For our analysis we reasoned that the assumption that correlated functions have the same peaks (in time) was a fair one to make. We investigated some of these well correlated functions later in our analysis and verified this small assumption. 2. The paper states that the chromatin modifiers and genes are not perfectly, simply correlated. In other words, there might be slight delays between peaks. However, we needed some starting point. We investigated some of these delays manually in the succeeding work.
The success of our pipeline is discussed in greater detail in Question 9, specifically when we investigate the gene expression and TF binding functions for a given gene side by side.
To analyze the genes that were regulated by a specific TF, we needed the RNA seq data. This data contained gene expression over 16 time points for all the genes. The first 14 time points matched the 14 time points from the TF data. Then we filtered the genes so all genes from the TF binding data had a matching gene from the RNA seq data. With this data, we created a correlation algorithm that determined how similar the expression and binding data was for a specific gene. We had to come up with this analysis completely on our own, and we are very happy with the results.
First, we created a dataframe with genes down the rows and expression data over time moving right across the columns. Next, we made a dataframe for each TF containing genes down the rows and binding data over time moving right across the columns. Being careful to make sure the gene columns matched up across these datasets we were able to calculate row wise correlation values for the gene expression data and gene binding data for a given TF.
The following code makes sure the data is in a format in which the rows can be directly mapped (the same) between the two dataframe described above.
#get the genes in the tf and rna data
TF.genes <- unique(kuangtf[,1])
rna.genes <- unique(kuangrna[,1])
#get rid of duplicated rows in the rna data
rna.unique <- unique(kuangrna)
#filter the rna data for the tf genes
rna.Genes <- filter(rna.unique, rna.unique[,1] %in% TF.genes)
#get the unique genes from this rna data
#may not need to do this again but did it just to be safe
rna.genes.u <- unique(rna.Genes[,1])
#It appears that all of the genes present in the TF data are not present in the rna data
#So need to filter again to get rid of these genes
kuangtf.new <- filter(kuangtf, TF.genes %in% rna.genes.u)
#now remove last two columns so the times add up
rna.new <- rna.Genes[,1:15]
dim(rna.new)
## [1] 966 15
dim(kuangtf.new)
## [1] 966 43
#now extract the different TFs
Gcn.new <- kuangtf.new[,c(1:15)]
dim(Gcn.new)
## [1] 966 15
Esa.new <- kuangtf.new[,c(1,16:29)]
dim(Esa.new)
## [1] 966 15
Set.new <- kuangtf.new[,c(1,30:43)]
dim(Set.new)
## [1] 966 15
#here we see the 69 genes that are in the TF data but not the rnaseq
length(which( ! TF.genes %in% rna.genes))
## [1] 69
#sort so that the genes are in the same order
Gcn.sorted <- Gcn.new[match(rna.new[,1], Gcn.new[,1]),]
Esa.sorted <- Esa.new[match(rna.new[,1], Esa.new[,1]),]
Set.sorted <- Set.new[match(rna.new[,1], Set.new[,1]),]
#preview
head(Gcn.sorted[,1:5])
## Gene.ID Gcn5_1 Gcn5_2 Gcn5_3 Gcn5_4
## 602 YAL003W 0.4210547 0.03921831 0.26878654 -1.22088738
## 603 YAL005C 0.4210547 0.03921831 0.26878654 -1.22088738
## 1 YAL012W -0.4480247 -0.10892122 -0.20125827 0.25552458
## 309 YAL013W -1.4122189 0.36154599 -0.24499443 -0.37735420
## 310 YAL020C -0.4501603 -0.67188929 -0.35009015 0.05787682
## 311 YAL021C 0.2242571 0.40294404 -0.06985582 0.15622274
head(Esa.sorted[,1:5])
## Gene.ID Esa1_1 Esa1_2 Esa1_3 Esa1_4
## 602 YAL003W 0.5797392 -0.81897569 -0.4128622 -1.8789875
## 603 YAL005C 0.1936854 -0.07163007 -0.6965876 -1.2663688
## 1 YAL012W -0.7628393 -0.43493956 -0.3507037 0.7034366
## 309 YAL013W -0.2664357 -0.49891799 -0.5468844 -0.9179981
## 310 YAL020C 0.2784185 -0.75261164 -1.4577396 0.1280955
## 311 YAL021C 0.2579206 -1.32855840 -1.3170291 -0.4971314
head(Set.sorted[,1:5])
## Gene.ID Set1_1 Set1_2 Set1_3 Set1_4
## 602 YAL003W -0.05706616 0.2252152 0.04715545 0.81900077
## 603 YAL005C 0.68438294 0.3425939 -0.13277217 -1.45068627
## 1 YAL012W -0.47354488 -0.5062939 0.89063299 1.10092792
## 309 YAL013W -0.30476058 -0.7171564 -0.74920966 -0.31134056
## 310 YAL020C -0.49396996 -1.0305297 -0.56247226 -0.78453701
## 311 YAL021C -0.28577606 -0.2225978 -0.73650211 0.01989187
head(rna.new[,1:5])
## Gene.name RNAseq1 RNAseq2 RNAseq3 RNAseq4
## 1 YAL003W 0.1499670 0.27219633 0.32033973 0.9181278
## 2 YAL005C 1.1067015 0.66791225 -0.01165236 -1.6360337
## 3 YAL012W -0.6172133 -0.03536448 0.56992820 1.6315391
## 4 YAL013W -0.1990946 -0.09576321 -0.19909460 -0.7412416
## 5 YAL020C -0.3296119 -0.86638324 0.11116839 -0.1179260
## 6 YAL021C 0.2582430 0.82327898 -0.09441158 0.6129605
At this point we performed the row-wise correlation and appended the results into one dataframe called df.cors.
##Need to drop gene name
Gcn.sorted.cor <- select(Gcn.sorted, -Gene.ID)
Esa.sorted.cor <- select(Esa.sorted, -Gene.ID)
Set.sorted.cor <- select(Set.sorted, -Gene.ID)
rna.new.cor <- select(rna.new, -Gene.name)
corValues.Gcn <- sapply(1:nrow(Gcn.sorted.cor), function(i) cor(as.numeric(Gcn.sorted.cor[i,]), as.numeric(rna.new.cor[i,])))
corValues.Esa <- sapply(1:nrow(Esa.sorted.cor), function(i) cor(as.numeric(Esa.sorted.cor[i,]), as.numeric(rna.new.cor[i,])))
corValues.Set <- sapply(1:nrow(Set.sorted.cor), function(i) cor(as.numeric(Set.sorted.cor[i,]), as.numeric(rna.new.cor[i,])))
df.cors <- data.frame(corValues.Gcn, corValues.Esa, corValues.Set)
head(df.cors)
## corValues.Gcn corValues.Esa corValues.Set
## 1 -0.36046925 -0.6178680 0.51229765
## 2 0.46909131 0.5667060 0.66575267
## 3 0.55020422 0.5351141 0.59075382
## 4 0.04204583 0.2804874 -0.09863229
## 5 0.55418612 0.6498429 -0.18210032
## 6 0.02022176 0.3406007 0.40254016
Filtering based on the threshold decided:
# But first lets append the gene names to the df.cors dataframe for more informative display
df.cors.w.names <- df.cors
row.names(df.cors.w.names) <- Gcn.sorted$Gene.ID #doesn't matter what set the names comes from - they are all the same
head(df.cors.w.names)
## corValues.Gcn corValues.Esa corValues.Set
## YAL003W -0.36046925 -0.6178680 0.51229765
## YAL005C 0.46909131 0.5667060 0.66575267
## YAL012W 0.55020422 0.5351141 0.59075382
## YAL013W 0.04204583 0.2804874 -0.09863229
## YAL020C 0.55418612 0.6498429 -0.18210032
## YAL021C 0.02022176 0.3406007 0.40254016
Gcn.indexes <- which(df.cors.w.names[,1] > .75)
Gcn.genes.above <- kuangtf.new[Gcn.indexes, 1]
Esa.indexes <- which(df.cors[,2] > .75)
Esa.genes.above <- kuangtf.new[Esa.indexes, 1]
Set.indexes <- which(df.cors[,3] > .75)
Set.genes.above <- kuangtf.new[Set.indexes, 1]
#print first 10 genes for each TF that meet threshold
head(df.cors.w.names[Gcn.indexes,1, drop = FALSE], 10)
## corValues.Gcn
## YBR010W 0.8391033
## YBR196C 0.7870837
## YCR012W 0.8264415
## YCR061W 0.7526478
## YDR154C 0.8655134
## YDR155C 0.8387431
## YDR224C 0.9580741
## YDR225W 0.7525263
## YEL063C 0.7809656
## YGR086C 0.8576394
head(df.cors.w.names[Esa.indexes,2, drop = FALSE], 10)
## corValues.Esa
## YBL075C 0.7765671
## YDR154C 0.7664506
## YEL024W 0.7814870
## YHL004W 0.7590135
## YJL210W 0.9275056
## YJR095W 0.8311625
## YKR012C 0.7892278
## YKR013W 0.7767776
## YLL041C 0.8253553
## YLR004C 0.7599581
head(df.cors.w.names[Set.indexes,3, drop = FALSE], 10)
## corValues.Set
## YBL042C 0.7927463
## YBR010W 0.8594544
## YBR078W 0.7594112
## YBR105C 0.8697967
## YCL037C 0.8044635
## YCR012W 0.8540802
## YDL125C 0.7535042
## YDR050C 0.7852305
## YDR154C 0.8601331
## YDR155C 0.8246246
Based on the result of the code below the following is clear (given correlation threshold = 0.75): * ESA regulates 19 genes * GCN regulates 26 genes * SET regualtes 44 genes
df.cors["gene"] = Set.sorted["Gene.ID"]
df.cors.single <- df.cors %>%
tidyr::gather(TF, corr, -gene) %>% # cast to long format
dplyr::select(TF, gene, corr)
df.cors.single.trimmed <- filter(df.cors.single, corr > 0.75)
hits.per.tf <- df.cors.single.trimmed %>%
group_by(TF) %>%
summarise(ngenes = length(TF))
hits.per.tf
## # A tibble: 3 × 2
## TF ngenes
## <chr> <int>
## 1 corValues.Esa 19
## 2 corValues.Gcn 26
## 3 corValues.Set 44
Below we changed the threshold values (to 0.8 and 0.5) to observe its effect on the group sizes.
df.cors.single.trimmed.8 <- filter(df.cors.single, corr > 0.8)
hits.per.tf.8 <- df.cors.single.trimmed.8 %>%
group_by(TF) %>%
summarise(ngenes = length(TF))
df.cors.single.trimmed.5 <- filter(df.cors.single, corr > 0.5)
hits.per.tf.5 <- df.cors.single.trimmed.5 %>%
group_by(TF) %>%
summarise(ngenes = length(TF))
hits.per.tf.8
## # A tibble: 3 × 2
## TF ngenes
## <chr> <int>
## 1 corValues.Esa 7
## 2 corValues.Gcn 14
## 3 corValues.Set 18
hits.per.tf.5
## # A tibble: 3 × 2
## TF ngenes
## <chr> <int>
## 1 corValues.Esa 183
## 2 corValues.Gcn 304
## 3 corValues.Set 302
We wanted to go a step further than just investigating how many genes each TF bound. We wanted to see how many genes were bound by several or all of the TFs. To do this we created a histogram of this distribution, below. From this histogram we can conclude that at the correlation threshold of 0.75 most of the genes are only bound by one TF. We decided that this conclusion is neither justified nor refuted by our knowledge. Also note we tried dropping the correlation value pretty significantly to 0.5 to see its effect on the distribution. The distribution did seem to spread out in this case, as more genes were bound by multiple TFs. However, the change was not a drastic as we thought it would be, relative to the large drop in the threshold.
tfs.per.gene <- df.cors.single.trimmed %>%
group_by(gene) %>%
summarise(ntfs = length(TF))
tfs.per.gene.5 <- df.cors.single.trimmed.5 %>%
group_by(gene) %>%
summarise(ntfs = length(TF))
ggplot(tfs.per.gene, aes(x = ntfs)) +
geom_histogram(binwidth = 1) +
labs(x = "Number of TFs", y = "Number of Genes",
title = "Distribution of number of TFs per gene (corr threshold = 0.75)")
ggplot(tfs.per.gene.5, aes(x = ntfs)) +
geom_histogram(binwidth = 1) +
labs(x = "Number of TFs", y = "Number of Genes",
title = "Distribution of number of TFs per gene (corr threshold = 0.5)")
See the network plots below. We will discuss these networks in 6.
df.cors.graph <- graph_from_data_frame(df.cors.single.trimmed)
vcount(df.cors.graph)
## [1] 75
ecount(df.cors.graph)
## [1] 89
tfs <- unique(df.cors.single.trimmed$TF)
V(df.cors.graph)$type <- V(df.cors.graph)$name %in% tfs
V(df.cors.graph)$shape <- c("square", "circle")[V(df.cors.graph)$type+1]
V(df.cors.graph)$color <- c("steelblue", "lightcoral")[V(df.cors.graph)$type+1]
count_components(df.cors.graph)
## [1] 1
df.cors.components <- decompose(df.cors.graph)
lapply(df.cors.components, vcount)
## [[1]]
## [1] 75
lapply(df.cors.components, ecount)
## [[1]]
## [1] 89
plot(df.cors.components[[1]],
layout = layout.auto,
vertex.label.cex=0.4,
vertex.label.degree=-90, vertex.label.dist=1)
df.cors.projection <- bipartite.projection(df.cors.graph)
gene.proj <- df.cors.projection[[1]]
tf.proj <- df.cors.projection[[2]]
tf.shared.cut <- delete.edges(tf.proj, E(tf.proj)[weight < 1])
plot(tf.shared.cut, vertex.label.cex=0.5, vertex.label.dist=0.5 )
tfs.only.graph <- graph_from_data_frame(df.cors.single.trimmed, directed = TRUE)
l <- layout.circle(tfs.only.graph)
plot(tfs.only.graph, edge.arrow.size=0.3, vertex.shape="none", vertex.label.cex=0.3, edge.curved=0.1, vertex.label.font=2, vertex.label.color="gray40", edge.color="gray85", layout=l)
For the following please note that we are referencing the plots from Q5
In the undirected graph (the first graph), genes are connected to 1, 2, or 3 TFs. Not all the genes in the TF binding data are represented in this network because we filtered out the genes with correlation values less than 0.75. The genes that are inclued can be seen by a “venn-diagram structure” as shown above.
In the undirected graph, the edges represent connections between TFs and genes. Tfs and genes that were connected by an edge had TF binding data and RNA seq data that was highly correlated. The nodes represent genes (blue square) and TFs (salmon circle).
The directed graph (the third graph) is similar, but the edge points from TF to gene because the TF appears to regulate the gene. The nodes in the directed graph represent genes and TFs, which are distinguishable by the name.
For both graphs, there are 3 TFs and 75 target genes. There are 89 edges, meaning that some genes are regulated by multiple TFs.
The second graph shows that all three TFs are connected because each TF is connected to at least one similar target gene.
The out degree for TFs Esa, Gcn, and Set are 19, 26, and 44 respectively. For the in degree, look at the histogram above. About 58 genes have in-degree 1, about 13 genes have in-degree 2, and about 4 genes have in-degree 3. This shows that about 58 genes, 13 genes, and 4 genes are regulated by only 1, 2, and 3 TFs repectively.
#head(df.cors.single.trimmed)
all.profiles <- gprofiler(df.cors.single.trimmed,
organism = "scerevisiae",
max_p_value = 0.05, # set p-value cutoff
src_filter = "GO:BP", # only consider GO biological process
hier_filtering = "moderate")
all.profiles
## query.number significant p.value term.size query.size overlap.size
## 1 gene TRUE 8.19e-08 35 70 9
## 2 gene TRUE 8.19e-08 35 70 9
## 3 gene TRUE 3.12e-06 141 70 13
## 4 gene TRUE 3.18e-02 462 70 16
## 5 gene TRUE 3.13e-05 31 70 7
## 6 gene TRUE 1.37e-06 132 70 13
## 7 gene TRUE 2.48e-07 197 70 16
## 8 gene TRUE 2.63e-02 111 70 8
## recall precision term.id domain subgraph.number
## 1 0.129 0.257 GO:0019319 BP 2
## 2 0.129 0.257 GO:0006094 BP 6
## 3 0.186 0.092 GO:0009150 BP 11
## 4 0.229 0.035 GO:0055085 BP 8
## 5 0.100 0.226 GO:0006096 BP 13
## 6 0.186 0.098 GO:0046128 BP 5
## 7 0.229 0.081 GO:0006091 BP 4
## 8 0.114 0.072 GO:0045333 BP 12
## term.name relative.depth
## 1 hexose biosynthetic process 1
## 2 gluconeogenesis 1
## 3 purine ribonucleotide metabolic process 1
## 4 transmembrane transport 1
## 5 glycolytic process 1
## 6 purine ribonucleoside metabolic process 1
## 7 generation of precursor metabolites and energy 1
## 8 cellular respiration 1
## intersection
## 1 YBR105C,YBR196C,YCR012W,YDR050C,YGR192C,YGR254W,YJL052W,YKL060C,YOL126C
## 2 YBR105C,YBR196C,YCR012W,YDR050C,YGR192C,YGR254W,YJL052W,YKL060C,YOL126C
## 3 YBR196C,YCR012W,YDR050C,YEL024W,YGR192C,YGR254W,YJL052W,YKL060C,YML075C,YMR217W,YNL052W,YOR065W,YPL036W
## 4 YBL042C,YBL075C,YEL024W,YEL063C,YER056C,YJL210W,YJR095W,YKL120W,YLR004C,YLR237W,YNL052W,YNL125C,YOL126C,YOR065W,YOR328W,YPL036W
## 5 YBR196C,YCR012W,YDR050C,YGR192C,YGR254W,YJL052W,YKL060C
## 6 YBR196C,YCR012W,YDR050C,YEL024W,YGR192C,YGR254W,YJL052W,YKL060C,YML075C,YMR217W,YNL052W,YOR065W,YPL036W
## 7 YBR196C,YCR012W,YDR050C,YEL024W,YER177W,YGR192C,YGR254W,YJL052W,YJR104C,YKL060C,YLL041C,YLR174W,YNL052W,YOL126C,YOR065W,YPL262W
## 8 YEL024W,YJR104C,YLL041C,YLR174W,YNL052W,YOL126C,YOR065W,YPL262W
To synthesize information about our network to think about the function of the TF(s) of interest, we decided to run a GO analysis. Specifically, we ran a GO analysis on only the trimmed dataset (the genes that we previously found were correlated with the TFs and therefore are regulated by the TFs) using gProfileR. The GO analyis split the genes up into 7 categories. These categories had various functions, with a common theme being that they were related to metabolism and the yeast metabolic cycle. This suggests that the function of the target genes is to regulate the metabolic cycle. Specific functional categories that are enriched among the target genes include “generation of precursor metabolites and energy”, “purine ribonucleotide metabolic process”, and “purine ribonucleoside monophosphate metabolic process”, and “tricarboxylic acid cycle”.
We only used 3 TFs in our network analysis. Therefore, all the edges generated were directly between thsoe TFs and the genes of interest, so every gene in our dataset was either a direct target of the TF (i.e. if the gene was included in the trimmed data), or not a target at all. We would need additional TF data to determine if some of the genes are indirect targets of the 3 TFs of interest.
Next we aimed to discover whether or not any of the TFs regulated the other TFs or themselves. To do this, for each TF we searched its regulated genes in an attempt to discover if these genes where the gene names of the other TFs. The corresponding mapping of TF to gene name is given below: * Gcn5 -> YGR252W * Esa1 -> YOR244W * Set1 -> YHR119W
Gcn5 regulation of itself and other genes
gcn.reg.genes <- as.character(Gcn.genes.above)
gcn.reg.genes
## [1] "YER091C" "YHR119W" "YLR004C" "YLR175W" "YPL274W" "YPR008W" "YBL042C"
## [8] "YBL072C" "YKL006W" "YDL033C" "YDR388W" "YEL066W" "YIL041W" "YJR009C"
## [15] "YKL146W" "YKL210W" "YNL031C" "YOR082C" "YBL074C" "YBR126C" "YDR508C"
## [22] "YGR267C" "YJR104C" "YLL056C" "YOL081W" "YOR324C"
"YGR252W" %in% gcn.reg.genes # Gcn5 not in data overall
## [1] FALSE
"YOR244W" %in% gcn.reg.genes #Esa1
## [1] FALSE
"YHR119W" %in% gcn.reg.genes # Set1
## [1] TRUE
From the work above it was found that Gcn5 does not regulate itself or Esa1, but does regulate Set1. However, we did some digging into the original dataset and discovered that YGR252W and YOR244W, corresponding to Gcn5 and Esa1 respectively, were not included. Thus, it is not necessarily the case that Gcn5 does not regulate itself or Esa1, but rather than no conclusion can be reached based on its lack of inclusion in the expression data.
Esa regulation of itself and other genes
esa.reg.genes <- as.character(Esa.genes.above)
esa.reg.genes
## [1] "YDR433W" "YPL274W" "YIL052C" "YER039C-A" "YLR023C"
## [6] "YLR349W" "YOR082C" "YOR083W" "YPL031C" "YPL265W"
## [11] "YBL075C" "YGR267C" "YGR269W" "YHR117W" "YIL100W"
## [16] "YLL056C" "YLR280C" "YNL194C" "YPL057C"
"YGR252W" %in% esa.reg.genes # Gcn5 not in data overall
## [1] FALSE
"YOR244W" %in% esa.reg.genes #Esa1 not in data overall
## [1] FALSE
"YHR119W" %in% esa.reg.genes # Set1
## [1] FALSE
Set1 regulation of itself and other genes
set.reg.genes <- as.character(Set.genes.above)
set.reg.genes
## [1] "YDL131W" "YER091C" "YGL234W" "YGR122C-A" "YKL056C"
## [6] "YLR004C" "YMR307W" "YOR236W" "YPL274W" "YPR008W"
## [11] "YBL042C" "YBL072C" "YLR388W" "YLR441C" "YOL040C"
## [16] "YCL058C" "YDR207C" "YDR388W" "YEL066W" "YGL122C"
## [21] "YGR204W" "YHR042W" "YKL210W" "YLR315W" "YLR375W"
## [26] "YNL117W" "YNL278W" "YPR093C" "YCR025C" "YDR096W"
## [31] "YEL033W" "YGR050C" "YGR267C" "YHR117W" "YJL100W"
## [36] "YJR104C" "YLR094C" "YLR216C" "YLR281C" "YNL274C"
## [41] "YOL081W" "YOR229W" "YOR230W" "YPL156C"
"YGR252W" %in% set.reg.genes # Gcn5 not in data overall
## [1] FALSE
"YOR244W" %in% set.reg.genes #Esa1 not in data overall
## [1] FALSE
"YHR119W" %in% set.reg.genes # Set1
## [1] FALSE
For the reason above no conclusions were reached about Esa1 and Set1 regulation of Gcn5 and Esa1. From the work above it was concluded that Esa1 does not regulate Set1 and Set1 does not regulate itself. In other words Set1 is not autoregulated.
harb <- read.table(
"./datasets/harbison-2004-TFbinding-YPD.txt",
sep="\t", header=TRUE,fill = TRUE)
head(harb[1:15])
## X X.1 X.2
## 1 YAL001C TFC3 TFIIIC (transcription initiation factor) subunit, 138 kD
## 2 YAL002W VPS8 vacuolar sorting protein, 134 kD
## 3 YAL003W EFB1 translation elongation factor eEF1beta
## 4 YAL004W YAL004W strong similarity to A.klebsiana glutamate dehydrogenase
## 5 YAL005C SSA1 heat shock protein of HSP70 family, cytosolic
## 6 YAL007C ERP2 p24 protein involved in membrane trafficking
## A1..MATA1._YPD ABF1_YPD ABT1_YPD ACA1_YPD ACE2_YPD ADR1_YPD AFT2_YPD
## 1 0.0687 0.656 0.3240 0.4620 0.523 0.0627 0.471
## 2 NaN NaN 0.8180 0.2430 0.714 0.7310 0.874
## 3 NaN NaN 0.4700 0.9410 0.299 0.7500 0.422
## 4 0.8360 0.648 0.7550 0.3320 0.489 0.1810 0.861
## 5 NaN NaN 0.4700 0.9410 0.299 0.7500 0.422
## 6 0.5430 0.107 0.0141 0.0706 0.888 0.1740 0.411
## ARG80_YPD ARG81_YPD ARO80_YPD ARR1_YPD ASH1_YPD
## 1 0.1850 0.0234 0.120 0.1120 0.822
## 2 0.0715 0.0332 0.316 0.1060 0.394
## 3 0.3310 0.2940 0.703 0.9230 0.671
## 4 0.5430 0.3800 0.528 0.0486 0.356
## 5 0.3310 0.2940 0.703 0.9230 0.671
## 6 0.5320 0.4080 0.435 0.8860 0.639
harb.tfs <- harb %>%
dplyr::select(-X.1, -X.2) %>%
dplyr::rename(target = X)
harb.tfs.long <- harb.tfs %>%
tidyr::gather(TF, pvalue, -target) %>% # cast to long format
dplyr::select(TF, target, pvalue)
harb.trimmed <- filter(harb.tfs.long, pvalue < 0.001)
head(harb.trimmed)
## TF target pvalue
## 1 A1..MATA1._YPD YBL018C 3.72e-04
## 2 ABF1_YPD YAL023C 3.59e-04
## 3 ABF1_YPD YAL041W 3.91e-04
## 4 ABF1_YPD YAL042W 1.34e-04
## 5 ABF1_YPD YAL043C 1.34e-04
## 6 ABF1_YPD YAL053W 8.07e-05
hits.per.tf <- harb.trimmed %>%
group_by(TF) %>%
summarise(ngenes = length(target))
tfs.per.gene <- harb.trimmed %>%
group_by(target) %>%
summarise(ntfs = length(TF))
ggplot(tfs.per.gene, aes(x = ntfs)) +
geom_histogram(binwidth = 1) +
labs(x = "Number of TFs", y = "Number of Genes",
title = "Distribution of number of TFs per gene")
set.overlap <- intersect(as.character(harb.trimmed$target),set.reg.genes)
set.overlap
## [1] "YBL072C" "YBL042C"
harb.trimmed[which(harb.trimmed$target==set.overlap),]
## TF target pvalue
## 73 FHL1_YPD YBL072C 2.04e-07
## 212 MSN1_YPD YBL042C 7.13e-05
esa.overlap <- intersect(as.character(harb.trimmed$target),esa.reg.genes)
esa.overlap
## [1] "YPL265W" "YBL075C"
harb.trimmed[which(harb.trimmed$target==esa.overlap),]
## TF target pvalue
## 31 ARG80_YPD YPL265W 3.32e-04
## 65 DAL81_YPD YPL265W 4.07e-05
## 336 SNT2_YPD YBL075C 7.92e-10
## 357 STP1_YPD YPL265W 4.41e-04
gcn.overlap <- intersect(as.character(harb.trimmed$target),gcn.reg.genes)
gcn.overlap
## [1] "YBL072C" "YBL042C" "YBL074C"
harb.trimmed[which(harb.trimmed$target==gcn.overlap),]
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(harb.trimmed$target, gcn.overlap): longer object
## length is not a multiple of shorter object length
## TF target pvalue
## 73 FHL1_YPD YBL072C 2.04e-07
## 212 MSN1_YPD YBL042C 7.13e-05
"YMR042W" %in% set.reg.genes
## [1] FALSE
"YMR042W" %in% gcn.reg.genes
## [1] FALSE
"YMR042W" %in% esa.reg.genes
## [1] FALSE
Above, we furthered our analysis with the harbison data. This data provided standard pvalue relationships for genes and TFs. We trimmed the dataset based on genes that had a pvalue less than 0.001 and then displayed the distribution of the distribution of number of TFs per gene. Lastly, we wanted to tie this analysis to our previous TF clustering analysis. We looked to see in the harbison trimmed dataset if there were any genes present in our kuang trimmed dataset. There were 7 genes that overlapped. We looked at the yeast database and typed in the common name for the corresponding TF. From this site, we found the corresponding long name. Then, we checked the original kuang dataset to see if that gene existed. Only “YMR042W” (which was from ARG80 TF) existed in the large kuang dataset, but it was not in any of the original kuang clusters. Our hope was that some of the genes from this analysis would be present in the clusters. This would have shown how the genes regulated themselves, but we were not able to draw those kinds of conclusions.
In this section we attempted to answer the question: “Do genes regulated by the TF fall into the same cluster, based on our previous clustering analysis?”
In order to answer this question we investigated three different relationships for each TF: 1. TF binding for genes in the TF cluster 2. RNAseq expression binding for genes in the TF cluster 3. Gene expression and binding data vs time for a given gene in the TF cluster
Before this we needed to do a little bit of set-up, shown below.
Gcn.indexes <- which(df.cors[,1] > .75)
Gcn.genes.above <- df.cors[Gcn.indexes, 4]
Esa.indexes <- which(df.cors[,2] > .75)
Esa.genes.above <- df.cors[Esa.indexes, 4]
Set.indexes <- which(df.cors[,3] > .75)
Set.genes.above <- df.cors[Set.indexes, 4]
We manually added the time data (not too many points)
timechar <- c(
"94.95",
"95.31666667",
"95.41666667",
"95.47222222",
"95.59722222",
"95.74166667",
"95.98333333",
"96.24166667",
"96.35555556",
"96.55833333",
"97.05833333",
"97.55833333",
"98.05833333",
"98.56666667")
time <- as.numeric(timechar)
Gcn5 Analysis
TF Binding
Gcn.genes.above.plot <- as.character(Gcn.genes.above)
Gcn.above <- filter(Gcn.new, Gene.ID %in% Gcn.genes.above.plot)
Gcn.t <- t(Gcn.above)
Gcn.t <- Gcn.t[2:nrow(Gcn.t),]
length(Gcn.genes.above.plot)
## [1] 26
colnames(Gcn.t) <- as.vector(Gcn.genes.above.plot)
Gcn.t.time <- cbind(Gcn.t, time = time)
Gcn.long <- gather(data.frame(Gcn.t.time), gene, binding, -time)
## Warning: attributes are not identical across measure variables; they will
## be dropped
Gcn.long$time <- as.numeric(as.character(Gcn.long$time))
Gcn.long$binding <- as.numeric(as.character(Gcn.long$binding))
Gcn.long %>% filter(gene %in% Gcn.genes.above.plot[1:10]) %>%
ggplot(aes(x=time, y=binding, color=gene)) + ggtitle("Gcn5 TFbinding for genes with cor > .75") + geom_line()
Above we plotted the Gcn5 binding for the first 10 genes regulated by Gcn5. Some of the genes show similar peak behavior, but this is not a strong relationship. We concluded that the above plot is somewhat expected. Even though the genes are regulated by the same TF does not necessarily mean they demonstrate the same TF binding pattern.
Gene expression
Gcn.rna <- filter(rna.new, Gene.name %in% Gcn.genes.above.plot)
Gcn.rna.t <- t(Gcn.rna)
newGenes <- Gcn.rna.t[1,]
Gcn.rna.t2 <- Gcn.rna.t[2:nrow(Gcn.rna.t),]
colnames(Gcn.rna.t2) <- as.vector(newGenes)
Gcn.rna.time <- cbind(Gcn.rna.t2, time = time)
Gcn.rna.long <- gather(data.frame(Gcn.rna.time), gene, expression, -time)
## Warning: attributes are not identical across measure variables; they will
## be dropped
Gcn.rna.long$time <- as.numeric(as.character(Gcn.rna.long$time))
Gcn.rna.long$expression <- as.numeric(as.character(Gcn.rna.long$expression))
Gcn.rna.long %>% filter(gene %in% Gcn.genes.above.plot[1:10]) %>%
ggplot(aes(x=time, y=expression, color=gene)) + ggtitle("Gcn5 RNAseq for genes with cor > .75") + geom_line()
The above plot shows gene expression over time for all the genes regulated by Gcn5. Again, there is not necessarily any expected behavior that this plot should have followed given our approach. However, the plot is insightful in answering the question stated at the start of this section. If these genes expressed similar expression over time it would support the claim that they are all in the same gene cluster. This seems to be the case for most of the genes shown, as they show a strong decline together around time 95.5. This suggested to us that Gcn5 might regulate genes that fall into the same cluster. We manually locate which genes fall in which cluster below (following the plots).
Gene Binding vs Expression
gene.of.interest <- Gcn.genes.above.plot[1]
p1 <- Gcn.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="red")
p2 <- Gcn.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="red")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Gcn.genes.above.plot[1]), gp=gpar(fontface="bold")))
gene.of.interest <- Gcn.genes.above.plot[2]
p1 <- Gcn.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="red")
p2 <- Gcn.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="red")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Gcn.genes.above.plot[2]), gp=gpar(fontface="bold")))
gene.of.interest <- Gcn.genes.above.plot[3]
p1 <- Gcn.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="red")
p2 <- Gcn.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="red")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Gcn.genes.above.plot[3]), gp=gpar(fontface="bold")))
The plots above show side by side graphs of gene expression and TF binding for the first three genes in the Ecn5 group. These plots suggest that our correlation evaluation was succcessful. The binding and expression functions peak at similar times as desired.
Esa1 and Set1
We found similar results for Esa1 and Set1. For all of the genes in these TFs the binding was not well correlated, while the expression was somewhat correlated. Again we saw, through analyzing the side by side plots of expression and binding, that our correlation method was relatively successful.
Esa1
Esa.genes.above.plot <- as.character(Esa.genes.above)
Esa.above <- filter(Esa.new, Gene.ID %in% Esa.genes.above.plot)
Esa.t <- t(Esa.above)
Esa.t <- Esa.t[2:nrow(Esa.t),]
length(Esa.genes.above.plot)
## [1] 19
colnames(Esa.t) <- as.vector(Esa.genes.above.plot)
Esa.t.time <- cbind(Esa.t, time = time)
Esa.long <- gather(data.frame(Esa.t.time), gene, binding, -time)
## Warning: attributes are not identical across measure variables; they will
## be dropped
Esa.long$time <- as.numeric(as.character(Esa.long$time))
Esa.long$binding <- as.numeric(as.character(Esa.long$binding))
Esa.long %>% filter(gene %in% Esa.genes.above.plot[1:10]) %>%
ggplot(aes(x=time, y=binding, color=gene)) + ggtitle("Esa1 TFbinding for genes with cor > .75") + geom_line()
Esa.rna <- filter(rna.new, Gene.name %in% Esa.genes.above.plot)
Esa.rna.t <- t(Esa.rna)
newGenes <- Esa.rna.t[1,]
Esa.rna.t2 <- Esa.rna.t[2:nrow(Esa.rna.t),]
colnames(Esa.rna.t2) <- as.vector(newGenes)
Esa.rna.time <- cbind(Esa.rna.t2, time = time)
Esa.rna.long <- gather(data.frame(Esa.rna.time), gene, expression, -time)
## Warning: attributes are not identical across measure variables; they will
## be dropped
Esa.rna.long$time <- as.numeric(as.character(Esa.rna.long$time))
Esa.rna.long$expression <- as.numeric(as.character(Esa.rna.long$expression))
Esa.rna.long %>% filter(gene %in% Esa.genes.above.plot[1:10]) %>%
ggplot(aes(x=time, y=expression, color=gene)) + ggtitle("Esa1 RNAseq for genes with cor > .75") + geom_line()
gene.of.interest <- Esa.genes.above.plot[1]
p1 <- Esa.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="blue")
p2 <- Esa.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="blue")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Esa.genes.above.plot[1]), gp=gpar(fontface="bold")))
gene.of.interest <- Esa.genes.above.plot[2]
p1 <- Esa.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="blue")
p2 <- Esa.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="blue")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Esa.genes.above.plot[2]), gp=gpar(fontface="bold")))
gene.of.interest <- Esa.genes.above.plot[3]
p1 <- Esa.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="blue")
p2 <- Esa.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="blue")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Esa.genes.above.plot[3]), gp=gpar(fontface="bold")))
Set1
Set.genes.above.plot <- as.character(Set.genes.above)
Set.above <- filter(Set.new, Gene.ID %in% Set.genes.above.plot)
Set.t <- t(Set.above)
Set.t <- Set.t[2:nrow(Set.t),]
length(Set.genes.above.plot)
## [1] 44
colnames(Set.t) <- as.vector(Set.genes.above.plot)
Set.t.time <- cbind(Set.t, time = time)
Set.long <- gather(data.frame(Set.t.time), gene, binding, -time)
## Warning: attributes are not identical across measure variables; they will
## be dropped
Set.long$time <- as.numeric(as.character(Set.long$time))
Set.long$binding <- as.numeric(as.character(Set.long$binding))
Set.long %>% filter(gene %in% Set.genes.above.plot[1:10]) %>%
ggplot(aes(x=time, y=binding, color=gene)) + ggtitle("Set1 TFbinding for genes with cor > .75") + geom_line()
Set.rna <- filter(rna.new, Gene.name %in% Set.genes.above.plot)
Set.rna.t <- t(Set.rna)
newGenes <- Set.rna.t[1,]
Set.rna.t2 <- Set.rna.t[2:nrow(Set.rna.t),]
colnames(Set.rna.t2) <- as.vector(newGenes)
Set.rna.time <- cbind(Set.rna.t2, time = time)
Set.rna.long <- gather(data.frame(Set.rna.time), gene, expression, -time)
## Warning: attributes are not identical across measure variables; they will
## be dropped
Set.rna.long$time <- as.numeric(as.character(Set.rna.long$time))
Set.rna.long$expression <- as.numeric(as.character(Set.rna.long$expression))
Set.rna.long %>% filter(gene %in% Set.genes.above.plot[1:10]) %>%
ggplot(aes(x=time, y=expression, color=gene)) + ggtitle("Set1 RNAseq for genes with cor > .75") + geom_line()
gene.of.interest <- Set.genes.above.plot[1]
p1 <- Set.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="green")
p2 <- Set.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="green")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Set.genes.above.plot[1]), gp=gpar(fontface="bold")))
gene.of.interest <- Set.genes.above.plot[4]
p1 <- Set.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="green")
p2 <- Set.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="green")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Set.genes.above.plot[4]), gp=gpar(fontface="bold")))
gene.of.interest <- Set.genes.above.plot[3]
p1 <- Set.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=binding)) + ggtitle("TFbinding") + geom_line(colour="green")
p2 <- Set.rna.long %>% filter(gene %in% gene.of.interest) %>%
ggplot(aes(x=time, y=expression)) + ggtitle("RNAseq Expression") + geom_line(colour="green")
grid.arrange(p1, p2, ncol=2, top=textGrob(as.character(Set.genes.above.plot[3]), gp=gpar(fontface="bold")))
Once we confirmed that the genes regulated by a certain TF seem to have similar expression patterns, and thus are likely to fall into the same clusters, we checked how many genes for a given TF actually fit into our k-means clusters (found in the previous homework).
For clarity we brought some of the clustering code into this homework (for refernce and “Knitting”).
rnaseq <- read.csv("./datasets/kuang-2014-rnaseq-expression.csv") #local needs to be changed
rnatime <- read.csv("./datasets/kuang-2014-rnaseq-expression-timepts.csv")
# first remember the names
nrna <- rnaseq$Gene.name
# transpose all but the first column (Gene.name)
rnaseq_t_zeros <- as.data.frame(t(rnaseq[,-1]))
colnames(rnaseq_t_zeros) <- nrna
rnaseq_t_zeros.time <- cbind(rnaseq_t_zeros, time = rnatime$hour)
rnaseq_t <- rnaseq_t_zeros[, colSums(rnaseq_t_zeros != 0) > 0]
rnaseq_t.time <- cbind(rnaseq_t, time = rnatime$hour)
rnaseq.cor <- cor(rnaseq_t, use = "pairwise.complete.obs")
rnaseq.dist <- as.dist(1-rnaseq.cor)
rnaseq.tree <- hclust(rnaseq.dist, method = "complete")
rnaseq.dend <- as.dendrogram(rnaseq.tree)
rnaseq.clusters <- cut(rnaseq.dend, h = 1.9)
rnaseq.clusters.cluster1 <- cut(rnaseq.clusters$lower[[1]], h = 1.5)
rnaseq.clusters.cluster1.clusterAgain <- cut(rnaseq.clusters.cluster1$lower[[1]], h = 1.2)
rnaseq.clusters.cluster2 <- cut(rnaseq.clusters$lower[[2]], h = 1.6)
rnaseq.clusters.cluster3 <- cut(rnaseq.clusters$lower[[3]], h = 1.3)
rnaseq.clusters.cluster3.cutAgain <- cut(rnaseq.clusters.cluster3$lower[[1]], h = .85)
rnaseq.clusters.cluster4 <- cut(rnaseq.clusters$lower[[4]], h = 1.48)
kuang <- read.csv("./datasets/kuang-2014-microarray-expression.csv") ##local- needs to be changed
kuang_t <- as.data.frame(t(kuang[,-1]))
n <- kuang$Gene.name # first remember the names
colnames(kuang_t) <- n
kuang_t.cor <- kuang_t %>% cor(use="pairwise.complete.obs") ##generate correlation measurements
kuang.dist <- as.dist(1 - kuang_t.cor) ##use correlation to find distance measurements
kuang.tree <- hclust(kuang.dist, method="complete")
kuang.dend <- as.dendrogram(kuang.tree) ## created & priviewed dendrogram object
k=3
kuang.kmedoids <- pam(kuang.dist, k) # create k-medoids clustering with k clusters
kclusters <- kuang.kmedoids$cluster
kclusters <- kclusters[order.dendrogram(kuang.dend)]
Once we successfully recreated the clusters we analyzed how the TF group of genes fell into these clusters.
kcluster1.genes <- names(kclusters[kclusters == 1])
kcluster2.genes <- names(kclusters[kclusters == 2])
kcluster3.genes <- names(kclusters[kclusters == 3])
in.1 <- kcluster1.genes[match(Gcn.genes.above.plot, kcluster1.genes)]
in.1.filter <-in.1[!is.na(in.1)]
in.2 <- kcluster2.genes[match(Gcn.genes.above.plot, kcluster2.genes)]
in.2.filter <- in.2[!is.na(in.2)]
in.3 <- kcluster3.genes[match(Gcn.genes.above.plot, kcluster3.genes)]
in.3.filter <- in.3[!is.na(in.3)]
in.1.filter
## [1] "YBR196C" "YCR012W" "YCR061W" "YDR154C" "YDR155C" "YGR086C" "YGR192C"
## [8] "YGR254W" "YJL052W" "YJL144W" "YJL153C" "YKL037W" "YKR012C" "YLR109W"
## [15] "YLR174W" "YML075C" "YOL126C" "YPL036W" "YPL171C"
in.2.filter
## [1] "YBR010W" "YDR224C" "YDR225W" "YIL123W" "YNL031C" "YNL289W"
in.3.filter
## [1] "YEL063C"
length(in.1.filter)
## [1] 19
length(in.2.filter)
## [1] 6
length(in.3.filter)
## [1] 1
In the above output the genes that are regulated by Gcn5 are printed, grouped by the gene cluster they fall in. It was found that 19 genes fall in cluster 1, 6 in cluster 2, and 1 in cluster 3. Thus, it was concluded that the genes that are binded by Gcn5 mostly fall in cluster 1, with a few genes in cluster 2.
length(Esa.genes.above.plot)
## [1] 19
Esa.1 <- kcluster1.genes[match(Esa.genes.above.plot, kcluster1.genes)]
Esa.1.filter <-Esa.1[!is.na(Esa.1)]
Esa.2 <- kcluster2.genes[match(Esa.genes.above.plot, kcluster2.genes)]
Esa.2.filter <- Esa.2[!is.na(Esa.2)]
Esa.3 <- kcluster3.genes[match(Esa.genes.above.plot, kcluster3.genes)]
Esa.3.filter <- Esa.3[!is.na(Esa.3)]
Esa.1.filter
## [1] "YBL075C" "YDR154C" "YJL210W" "YJR095W" "YKR012C" "YNL125C" "YOL126C"
## [8] "YOR083W" "YOR328W"
Esa.2.filter
## [1] "YEL024W" "YHL004W" "YKR013W" "YLL041C" "YLR121C" "YNL031C" "YNL052W"
## [8] "YPL262W"
Esa.3.filter
## [1] "YLR004C" "YNL112W"
length(Esa.1.filter)
## [1] 9
length(Esa.2.filter)
## [1] 8
length(Esa.3.filter)
## [1] 2
In the above output the genes that are regulated by Esa1 are printed, grouped by the gene cluster they fall in. It was found that 9 genes fall in cluster 1, 8 in cluster 2, and 2 in cluster 3. Thus, it was concluded that the genes that are binded by Esa1 are almost even split between cluster 1 and 2.
length(Set.genes.above.plot)
## [1] 44
Set.1 <- kcluster1.genes[match(Set.genes.above.plot, kcluster1.genes)]
Set.1.filter <-Set.1[!is.na(Set.1)]
Set.2 <- kcluster2.genes[match(Set.genes.above.plot, kcluster2.genes)]
Set.2.filter <- Set.2[!is.na(Set.2)]
Set.3 <- kcluster3.genes[match(Set.genes.above.plot, kcluster3.genes)]
Set.3.filter <- Set.3[!is.na(Set.3)]
Set.1.filter
## [1] "YCR012W" "YDL125C" "YDR050C" "YDR154C" "YDR155C" "YER177W" "YGR192C"
## [8] "YGR254W" "YHR087W" "YJL153C" "YJR073C" "YJR104C" "YKL060C" "YML126C"
## [15] "YNL208W" "YOL155C" "YPL036W" "YPL117C"
Set.2.filter
## [1] "YBR010W" "YBR078W" "YDR224C" "YDR225W" "YGR014W" "YLR023C" "YNL031C"
## [8] "YNL289W" "YOR065W"
Set.3.filter
## [1] "YBL042C" "YBR105C" "YCL037C" "YER056C" "YER062C" "YGR159C" "YHR179W"
## [8] "YIL053W" "YKL120W" "YLR237W" "YLR367W" "YMR217W" "YNL112W" "YOR091W"
## [15] "YOR361C" "YPL126W" "YPR065W"
length(Set.1.filter)
## [1] 18
length(Set.2.filter)
## [1] 9
length(Set.3.filter)
## [1] 17
In the above output the genes that are regulated by Set1 are printed, grouped by the gene cluster they fall in. It was found that 18 genes fall in cluster 1, 9 in cluster 2, and 17 in cluster 3. Thus, it was concluded that the genes that are binded by Set1 are likely to fall in cluster 1 or 3 (almost with the same occurances), but can still fall in cluster 2.
From the analysis above we concluded that genes regulated by a given TF are likely to fall into one or two clusters. However, we found that there is no means to justify that the relationship is exclusive, meaning genes regulated by a given TF do not fall into only one cluster.
We repeated the process for one hierarchical cluster, to see how different the results would be.
test <- labels(rnaseq.clusters.cluster1$lower[[1]])
in.1.h <- test[match(Gcn.genes.above.plot, test)]
in.1.h.filter <-in.1.h[!is.na(in.1.h)]
in.2.h <- test[match(Esa.genes.above.plot, test)]
in.2.h.filter <- in.2.h[!is.na(in.2.h)]
in.3.h <- test[match(Set.genes.above.plot, test)]
in.3.h.filter <- in.3.h[!is.na(in.3.h)]
length(in.1.h.filter)
## [1] 16
length(in.2.h.filter)
## [1] 7
length(in.3.h.filter)
## [1] 14
The above analysis shows the Gcn5 regulates 16 genes in cluster 1, while Esa1 regulates 7, and Set1 regulates 14. This again suggests that the relationship between TF and cluster is not perfectly exclusive. We have a larger number of hierarchical clusters and thus it would be too much to exhausitvely check and see how the genes are split across the clusters for a given TF. We believed that our k-means clusters are a better indicator (with k=3) since the paper uses a smaller amount of clusters.
Gcn.indexes.sup <- which(df.cors[,1] < (-.75))
Gcn.genes.above.sup <- df.cors[Gcn.indexes.sup, 4]
Gcn.genes.above.plot.sup <- as.character(Gcn.genes.above.sup)
Esa.indexes.sup <- which(df.cors[,2] < -.75)
Esa.genes.above.sup <- df.cors[Esa.indexes.sup, 4]
Esa.genes.above.plot.sup <- as.character(Esa.genes.above.sup)
Set.indexes.sup <- which(df.cors[,3] < -.75)
Set.genes.above.sup <- df.cors[Set.indexes.sup, 4]
Set.genes.above.plot.sup <- as.character(Set.genes.above.sup)
length(Gcn.genes.above.plot.sup)
## [1] 3
length(Esa.genes.above.plot.sup)
## [1] 2
length(Set.genes.above.plot.sup)
## [1] 0
Gcn.genes.above.plot.sup
## [1] "YJL115W" "YLR354C" "YPR103W"
Esa.genes.above.plot.sup
## [1] "YKL117W" "YML025C"
There are a lot fewer genes with a correlation lower than -0.75 than there are with a correlation higher than 0.75. If we say that a correlation above 0.75 means the TF strongly binds to a gene, then a correlation below -0.75 means the TF strongly supresses the gene. We can see that there were only 5 genes that were strongly supressed by a TF. This is interesting because there were 89 genes that were strongly bound by one of the three TFs.